Self-Supervised Anomaly Detection from Anomalous Training Data via Iterative Latent Token Masking

Anomaly detection and segmentation pose an important task across sectors ranging from medical imaging analysis to industry quality control. However, current unsupervised approaches require training data to not contain any anomalies, a requirement that can be especially challenging in many medical imaging scenarios. In this paper, we propose Iterative Latent Token Masking, a self-supervised framework derived from a robust statistics point of view, translating an iterative model fitting with M-estimators to the task of anomaly detection. In doing so, this allows the training of unsupervised methods on datasets heavily contaminated with anomalous images. Our method stems from prior work on using Transformers, combined with a Vector Quantized-Variational Autoencoder, for anomaly detection, a method with state-of-the-art performance when trained on normal (non-anomalous) data. More importantly, we utilise the token masking capabilities of Transformers to filter out suspected anomalous tokens from each sample’s sequence in the training set in an iterative self-supervised process, thus overcoming the difficulties of highly anomalous training data. Our work also highlights shortfalls in current state-of-the-art self-supervised, self-trained and unsupervised models when faced with small proportions of anomalous training data. We evaluate our method on whole-body PET data in addition to showing its wider application in more common computer vision tasks such as the industrial MVTec Dataset. Using varying levels of anomalous training data, our method showcases a superior performance over several state-of-the-art models, drawing attention to the potential of this approach.

The architecture used for the VQ-VAE model for the MVTec Dataset used an encoder consisting of three downsampling layers that contain a convolution with stride 2 and kernel size 4 followed by a ReLU activation and 3 residual blocks.Each residual block consists of a kernel of size 3, followed by a RelU activation, a convolution of kernel size 1 and another ReLU activation.Similar to the encoder, the decoder has 3 layers of 3 residual blocks, each followed by a transposed convolutional layer with stride 2 and kernel size 4. Finally, before the last transposed convolutional layer, a Dropout layer with a probability of 0.05 is added.The VQ-VAE codebook with optimal reconstruction was found to have 256 atomic elements (vocabulary size), each of length 32.To train the VQ-VAE network, we used an ADAM optimiser with a learning rate of 1e-4 and an exponential learning rate decay with a gamma of 0.9999.Training was run for 20,000 epochs with a batch size of 16.During training, the data was augmented with Gaussian noise, contrast adjustment, intensity shifts, translations, rotations and scaling.The VQ-VAE model used for the 3D PET data consisted of the same architecture with 3D kernels.Additionally the codebook has a 256 vocabulary size each of length 128.This model was trained for 1000 epochs with a batch size of 3. In addition to the augmentations used for the MVTec dataset, elastic deformations were also carried out during training.

B. Transformer Implementation
Once the VQ-VAE model was trained, and training data could be encoded, a Transformer could then be trained on the encoded images, using their discrete latent representations.The self-attention mechanism is best described as a mapping of intermediate representations of three positionwise linear layers onto three representations denoted by the Value (V), Key (K) and Query (Q) [28].With d k denoting the dimension of the key, query and value vectors, the attention mechanism is calculated as follows: The Transformer success relies on the self-attention mechanisms employed to capture the interactions between inputs in the sequence regardless of their relative position to one another.This relies on the inner product between elements of the sequence and as such the network scales quadratically with sequence length.This is a key limitation when applied to image data.In this work, we use the Performer variant that proposed a linear generalized attention offering a more scalable approach [7].The performer makes use of multiheaded self-attention.This aspect in the network is several attention layers run in parallel with their outputs concatenated and fed through a linear layer.
The performer used in this work for the MVTec dataset corresponds to a decoder Transformer architecture with 16 layers, each with 8 heads, and an embedding size of 32.For PET data, the performer had 14 layers, each with 8 heads and an embedding size of 256.To train the network, we used an ADAM optimiser with a learning rate of 1e-3, an exponential learning rate decay with a gamma of 0.9999 and cross-entry loss.Furthermore the embedding, feedforward and attention mechanisms within the network all had a dropout probability of 0.1.
Given the nature of the recursive training, and required inference on training to generate the latent code masks, to avoid extra overfitting to the original training data resulting in poor anomaly detection performance on the training data, the data fed in to the Transformer during training was augmented data, using the same data augmentations used in VQ-VAE training, i.e.Gaussian noise, contrast adjustment, intensity shifts, vertical and horizontal translations, rotations, scaling (and elastic deformations in the case of the PET data).The training was then performed over 80 epochs with a batch size of 1.

Table 2 .
Anomaly detection results of the proposed method in comparison to state-of-the-art comparisons.For each dataset category, AUROC (top row) and AUPRO (bottom row) are given along with the respective standard deviation.The best-performing method is highlighted in boldface.